Refactor filterByTainted#27584
Conversation
7e36f94 to
bb1743b
Compare
bb1743b to
f838bfc
Compare
f838bfc to
fbd20af
Compare
|
Everytime I see |
|
why Nodes? |
gulducat
left a comment
There was a problem hiding this comment.
This is fabulous. The code itself looks solid to me. My comments are all about adding even more clarity based on the new structure.
| condition func(allocContext) bool | ||
| category allocCategory |
There was a problem hiding this comment.
suuuper nit-picky, but reversing the order of these both a) is alphabetical, and b) puts the simpler information (category) first. the latter I think makes the already way more readable classification rule slice just that extra little bit more so.
| condition func(allocContext) bool | |
| category allocCategory | |
| category allocCategory | |
| condition func(allocContext) bool |
| condition: func(ctx allocContext) bool { | ||
| return ctx.shouldReconnect && | ||
| ctx.alloc.DesiredStatus == structs.AllocDesiredStatusRun && | ||
| ctx.alloc.ClientStatus == structs.AllocClientStatusFailed | ||
| }, | ||
| category: categoryReconnecting, |
There was a problem hiding this comment.
e.g. per my previous comment
| condition: func(ctx allocContext) bool { | |
| return ctx.shouldReconnect && | |
| ctx.alloc.DesiredStatus == structs.AllocDesiredStatusRun && | |
| ctx.alloc.ClientStatus == structs.AllocClientStatusFailed | |
| }, | |
| category: categoryReconnecting, | |
| category: categoryReconnecting, | |
| condition: func(ctx allocContext) bool { | |
| return ctx.shouldReconnect && | |
| ctx.alloc.DesiredStatus == structs.AllocDesiredStatusRun && | |
| ctx.alloc.ClientStatus == structs.AllocClientStatusFailed | |
| }, |
category is always just the one line, so my brain latches on to it a little easier before the variable-length condition.
(I repeat: this is super nit-picky 😋)
|
|
||
| // classificationRules are evaluated in order; first match wins. | ||
| var classificationRules = []classificationRule{ | ||
| // Failed allocs that need reconnect and are still desired to run. |
There was a problem hiding this comment.
| // Failed allocs that need reconnect and are still desired to run. | |
| // Failed allocs that need to reconnect and are still desired to run. |
| }, | ||
| category: categoryExpiring, | ||
| }, | ||
| // Failed reconnects that server already marked to stop should be ignored. |
There was a problem hiding this comment.
| // Failed reconnects that server already marked to stop should be ignored. | |
| // Failed reconnects that the server already marked to stop should be ignored. |
I assume this refers to the nomad server/leader?
| // Disconnected node and alloc already unknown. | ||
| { | ||
| condition: func(ctx allocContext) bool { | ||
| return ctx.taintedNode != nil && | ||
| ctx.taintedNode.Status == structs.NodeStatusDisconnected && | ||
| ctx.alloc.ClientStatus == structs.AllocClientStatusUnknown | ||
| }, | ||
| category: categoryUntainted, |
There was a problem hiding this comment.
A bonus from these conditions being so well-contained is that it's easier for me to isolate where wording confuses me.
This is saying "Sure, the node is tainted (can't handle allocs), but the alloc has the appropriate status, so it's fine -- we don't need to do anything with it." Right? I.e. the alloc already in the state that we want/expect it to be.
Short of renaming untainted, maybe the comment could elaborate on this a bit?
| // Disconnected node and alloc already unknown. | |
| { | |
| condition: func(ctx allocContext) bool { | |
| return ctx.taintedNode != nil && | |
| ctx.taintedNode.Status == structs.NodeStatusDisconnected && | |
| ctx.alloc.ClientStatus == structs.AllocClientStatusUnknown | |
| }, | |
| category: categoryUntainted, | |
| // Disconnected node, but alloc already marked unknown, so no changes needed. | |
| { | |
| category: categoryUntainted, | |
| condition: func(ctx allocContext) bool { | |
| return ctx.taintedNode != nil && | |
| ctx.taintedNode.Status == structs.NodeStatusDisconnected && | |
| ctx.alloc.ClientStatus == structs.AllocClientStatusUnknown | |
| }, |
or something like that?
| alloc: func() *structs.Allocation { | ||
| a := makeAlloc("c3", "ready", structs.AllocClientStatusComplete, structs.AllocDesiredStatusRun) | ||
| a.DeploymentStatus = &structs.AllocDeploymentStatus{Canary: true} | ||
| a.DesiredTransition = structs.DesiredTransition{Migrate: pointer.Of(true)} |
There was a problem hiding this comment.
mainly to avoid issues after #27875 that removes pointer.Of
| a.DesiredTransition = structs.DesiredTransition{Migrate: pointer.Of(true)} | |
| a.DesiredTransition = structs.DesiredTransition{Migrate: new(true)} |
| name: "migrate flag", | ||
| alloc: func() *structs.Allocation { | ||
| a := makeAlloc("c11", "ready", structs.AllocClientStatusPending, structs.AllocDesiredStatusRun) | ||
| a.DesiredTransition = structs.DesiredTransition{Migrate: pointer.Of(true)} |
There was a problem hiding this comment.
| a.DesiredTransition = structs.DesiredTransition{Migrate: pointer.Of(true)} | |
| a.DesiredTransition = structs.DesiredTransition{Migrate: new(true)} |
| testCases := []struct { | ||
| name string | ||
| alloc *structs.Allocation | ||
| nodeMap map[string]*structs.Node |
There was a problem hiding this comment.
I don't think we need this field; every test case passes in nodes for this value.
| nodeMap map[string]*structs.Node |
| for _, tc := range testCases { | ||
| t.Run(tc.name, func(t *testing.T) { | ||
| all := allocSet{tc.alloc.ID: tc.alloc} | ||
| state := ClusterState{Now: now, TaintedNodes: tc.nodeMap} |
There was a problem hiding this comment.
related to other comment,
| state := ClusterState{Now: now, TaintedNodes: tc.nodeMap} | |
| state := ClusterState{Now: now, TaintedNodes: nodes} |
| all := allocSet{tc.alloc.ID: tc.alloc} | ||
| state := ClusterState{Now: now, TaintedNodes: tc.nodeMap} | ||
|
|
||
| untainted, migrate, lost, disconnecting, reconnecting, ignore, expiring := all.filterByTainted(state) |
There was a problem hiding this comment.
I'll put this comment here so github makes it a thread to discuss, only tangentially related to this line.
I agree with @mismithhisler that we could rename this func, because filterByTainted is not very intuitive, but you're right that it's not Nodes that are being categorized. How about categorizeAllocs?
I also think the doc comment could be a little clearer, while we're at it, something like
// categorizeAllocs takes a set of tainted nodes and filters the allocation set
// into the following groups:
// 1. untainted: allocs on untainted nodes (AND clarify what exactly this means...)
// 2. migrate: allocs on nodes that are draining and need to be moved
// 3. lost: allocs on lost nodes or have expired and need to be replaced
// 4. disconnect: allocs on nodes that are disconnected, but have not had their ClientState set to unknown
// 5. reconecting: allocs on a node that has reconnected, and need ......
// 6. ignore: allocs in a state that results in a noop that only need to be counted
// 7. expiring: allocs on disconnected nodes and need to be marked lost (and possibly replaced)
func (set allocSet) categorizeAllocs(state ClusterState)
(untainted, migrate, lost, disconnecting, reconnecting, ignore, expiring allocSet) {?
There was a problem hiding this comment.
Description
This PR attempts to refactor de function
filterByTaintedto make it clearer and easier to expand. It maintains all its functionality and adds testing.The approach is to have a priority organised classification rules of the shape
condition func(allocContext) boolthat will determine in with bucket should each allocation end up.The
allocContextcontains all the information necessary to classify an alloc:Testing & Reproduction steps
Links
Contributor Checklist
changelog entry using the
make clcommand.ensure regressions will be caught.
and job configuration, please update the Nomad product documentation, which is stored in the
web-unified-docsrepo. Refer to theweb-unified-docscontributor guide for docs guidelines.Please also consider whether the change requires notes within the upgrade
guide. If you would like help with the docs, tag the
nomad-docsteam in this PR.Reviewer Checklist
backporting document.
in the majority of situations. The main exceptions are long-lived feature branches or merges where
history should be preserved.
within the public repository.
Changes to Security Controls
Are there any changes to security controls (access controls, encryption, logging) in this pull request? If so, explain.